The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned.
# Installing the libraries with the specified version.
# uncomment and run the following line if Google Colab is being used
!pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imbalanced-learn==0.10.1 xgboost==2.0.3 -q --user
# Installing the libraries with the specified version.
# uncomment and run the following lines if Jupyter Notebook is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imblearn==0.12.0 xgboost==2.0.3 -q --user
# !pip install --upgrade -q threadpoolctl
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
# This will help in making the Python code more structured automatically (good coding practice)
# %load_ext nb_black
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# To tune model, get different metric scores, and split data
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
ConfusionMatrixDisplay,
)
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To impute missing values
from sklearn.impute import SimpleImputer
from sklearn import metrics
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# To do hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To help with model building
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
# To suppress scientific notations
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
bank = pd.read_csv("BankChurners.csv")
bank.shape
(10127, 21)
The dataset has 10127 rows and 21 columns
data = bank.copy()
data.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
data.tail()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10122 | 772366833 | Existing Customer | 50 | M | 2 | Graduate | Single | $40K - $60K | Blue | 40 | 3 | 2 | 3 | 4003.000 | 1851 | 2152.000 | 0.703 | 15476 | 117 | 0.857 | 0.462 |
| 10123 | 710638233 | Attrited Customer | 41 | M | 2 | NaN | Divorced | $40K - $60K | Blue | 25 | 4 | 2 | 3 | 4277.000 | 2186 | 2091.000 | 0.804 | 8764 | 69 | 0.683 | 0.511 |
| 10124 | 716506083 | Attrited Customer | 44 | F | 1 | High School | Married | Less than $40K | Blue | 36 | 5 | 3 | 4 | 5409.000 | 0 | 5409.000 | 0.819 | 10291 | 60 | 0.818 | 0.000 |
| 10125 | 717406983 | Attrited Customer | 30 | M | 2 | Graduate | NaN | $40K - $60K | Blue | 36 | 4 | 3 | 3 | 5281.000 | 0 | 5281.000 | 0.535 | 8395 | 62 | 0.722 | 0.000 |
| 10126 | 714337233 | Attrited Customer | 43 | F | 2 | Graduate | Married | Less than $40K | Silver | 25 | 6 | 2 | 4 | 10388.000 | 1961 | 8427.000 | 0.703 | 10294 | 61 | 0.649 | 0.189 |
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
# check for duplicate values in the data
data.duplicated().sum()
0
#missing values in the data
round(data.isnull().sum() / data.isnull().count() * 100, 2)
CLIENTNUM 0.000 Attrition_Flag 0.000 Customer_Age 0.000 Gender 0.000 Dependent_count 0.000 Education_Level 15.000 Marital_Status 7.400 Income_Category 0.000 Card_Category 0.000 Months_on_book 0.000 Total_Relationship_Count 0.000 Months_Inactive_12_mon 0.000 Contacts_Count_12_mon 0.000 Credit_Limit 0.000 Total_Revolving_Bal 0.000 Avg_Open_To_Buy 0.000 Total_Amt_Chng_Q4_Q1 0.000 Total_Trans_Amt 0.000 Total_Trans_Ct 0.000 Total_Ct_Chng_Q4_Q1 0.000 Avg_Utilization_Ratio 0.000 dtype: float64
Education_Level has 15% missing value.
Marital_Status has 7.4% missing value.
All other columns do not have any missing values.
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CLIENTNUM | 10127.000 | 739177606.334 | 36903783.450 | 708082083.000 | 713036770.500 | 717926358.000 | 773143533.000 | 828343083.000 |
| Customer_Age | 10127.000 | 46.326 | 8.017 | 26.000 | 41.000 | 46.000 | 52.000 | 73.000 |
| Dependent_count | 10127.000 | 2.346 | 1.299 | 0.000 | 1.000 | 2.000 | 3.000 | 5.000 |
| Months_on_book | 10127.000 | 35.928 | 7.986 | 13.000 | 31.000 | 36.000 | 40.000 | 56.000 |
| Total_Relationship_Count | 10127.000 | 3.813 | 1.554 | 1.000 | 3.000 | 4.000 | 5.000 | 6.000 |
| Months_Inactive_12_mon | 10127.000 | 2.341 | 1.011 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Contacts_Count_12_mon | 10127.000 | 2.455 | 1.106 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Credit_Limit | 10127.000 | 8631.954 | 9088.777 | 1438.300 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.000 | 1162.814 | 814.987 | 0.000 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.000 | 7469.140 | 9090.685 | 3.000 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.000 | 0.760 | 0.219 | 0.000 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.000 | 4404.086 | 3397.129 | 510.000 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.000 | 64.859 | 23.473 | 10.000 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.000 | 0.712 | 0.238 | 0.000 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.000 | 0.275 | 0.276 | 0.000 | 0.023 | 0.176 | 0.503 | 0.999 |
Clientnum is a unique value and has no statistical importance, hence can be dropped.
Customer age range is between 26 and 73.
The maximum in months that the customer has been with the bank is 56 months.
50% have atleast 2 dependents.
# categorical variables
cat_col = data.columns
data.drop(["CLIENTNUM"], axis=1, inplace=True) #Drop clientnum as it is unique per customer and has no relation to the target variable.
data["Attrition_Flag"].replace("Attrited Customer", 1, inplace=True)
data["Attrition_Flag"].replace("Existing Customer", 0, inplace=True)
#create copy of data
data1 = data.copy()
Questions:
total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)?Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
### Function to plot distributions
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
num_col_sel = data.select_dtypes(include=np.number).columns.tolist()
for item in num_col_sel:
histogram_boxplot(data, item)
Mean Credit limit is greater than 5000 but the median is less than 5000.
The median for Customer Age is 46.
data['Attrition_Flag'].nunique()
2
data['Attrition_Flag'].value_counts()
0 8500 1 1627 Name: Attrition_Flag, dtype: int64
8500 customers still have an account, while 1627 customers have closed their accounts.
labeled_barplot(data,'Attrition_Flag', perc=True)
data['Customer_Age'].nunique()
45
data['Customer_Age'].value_counts()
44 500 49 495 46 490 45 486 47 479 43 473 48 472 50 452 42 426 51 398 53 387 41 379 52 376 40 361 39 333 54 307 38 303 55 279 56 262 37 260 57 223 36 221 35 184 59 157 58 157 34 146 33 127 60 127 32 106 65 101 61 93 62 93 31 91 26 78 30 70 63 65 29 56 64 43 27 32 28 29 67 4 66 2 68 2 70 1 73 1 Name: Customer_Age, dtype: int64
The mode for Customer Age is 44
sns.boxplot(data=data,x='Customer_Age')
#Boxplot to show the distribution of Customer_Age
<Axes: xlabel='Customer_Age'>
More than 50% of the clients fall under 50 years. There are few outliers on the upper end.
data['Gender'].nunique()
2
data['Gender'].value_counts()
F 5358 M 4769 Name: Gender, dtype: int64
labeled_barplot(data,'Gender', perc=True)
data['Dependent_count'].nunique()
6
data['Dependent_count'].value_counts()
3 2732 2 2655 1 1838 4 1574 0 904 5 424 Name: Dependent_count, dtype: int64
labeled_barplot(data,'Dependent_count', perc=True)
More than 50% of the customers have atleast 2 dependents.
Atleast 90% of the customers have atleast one dependent.
data['Education_Level'].nunique()
6
data['Education_Level'].value_counts()
Graduate 3128 High School 2013 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: Education_Level, dtype: int64
labeled_barplot(data,'Education_Level', perc=True)
30.9% are graduates.
Only 4.5% have a doctorate.
50% have atleast a college level of education.
data['Marital_Status'].nunique()
3
data['Marital_Status'].value_counts()
Married 4687 Single 3943 Divorced 748 Name: Marital_Status, dtype: int64
labeled_barplot(data,'Marital_Status', perc=True)
46.3% of the customer of the customers are married.
Only 7.4 % of the customers are divorced
data['Income_Category'].nunique()
6
data['Income_Category'].value_counts()
Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 abc 1112 $120K + 727 Name: Income_Category, dtype: int64
labeled_barplot(data,'Income_Category', perc=True)
35.2 % of the customers earn less than $40k.
There is an anomalous value "abc" which accounts for 11% of the values.
data['Card_Category'].nunique()
4
data['Card_Category'].value_counts()
Blue 9436 Silver 555 Gold 116 Platinum 20 Name: Card_Category, dtype: int64
labeled_barplot(data,'Card_Category', perc=True)
93.2% of the customers have Blue cards.
Only 0.2 % have Platinum Cards.
data['Months_on_book'].nunique()
44
data['Months_on_book'].value_counts()
36 2463 37 358 34 353 38 347 39 341 40 333 31 318 35 317 33 305 30 300 41 297 32 289 28 275 43 273 42 271 29 241 44 230 45 227 27 206 46 197 26 186 47 171 25 165 48 162 24 160 49 141 23 116 22 105 56 103 50 96 21 83 51 80 53 78 20 74 13 70 19 63 52 62 18 58 54 53 55 42 17 39 15 34 16 29 14 16 Name: Months_on_book, dtype: int64
sns.boxplot(data=data,x='Months_on_book')
#Boxplot to show the distribution of Months_on_book
<Axes: xlabel='Months_on_book'>
More than 75% have been with the bank for atleast 30 months.
data['Total_Relationship_Count'].nunique()
6
data['Total_Relationship_Count'].value_counts()
3 2305 4 1912 5 1891 6 1866 2 1243 1 910 Name: Total_Relationship_Count, dtype: int64
labeled_barplot(data,'Total_Relationship_Count', perc=True)
22.8% have held 3 products.
data['Months_Inactive_12_mon'].nunique()
7
data['Months_Inactive_12_mon'].value_counts()
3 3846 2 3282 1 2233 4 435 5 178 6 124 0 29 Name: Months_Inactive_12_mon, dtype: int64
sns.boxplot(data=data,x='Months_Inactive_12_mon')
#Boxplot to show the distribution of Months_Inactive_12_mon
<Axes: xlabel='Months_Inactive_12_mon'>
50% of the customers have been inactive for 2-3 months. Only 29 customers weren't inactive at all.
data['Contacts_Count_12_mon'].nunique()
7
data['Contacts_Count_12_mon'].value_counts()
3 3380 2 3227 1 1499 4 1392 0 399 5 176 6 54 Name: Contacts_Count_12_mon, dtype: int64
sns.boxplot(data=data,x='Contacts_Count_12_mon')
#Boxplot to show the distribution of Contacts_Count_12_mon
<Axes: xlabel='Contacts_Count_12_mon'>
50% had 2-3 contacts.
data['Credit_Limit'].nunique()
6205
data['Credit_Limit'].value_counts()
34516.000 508
1438.300 507
9959.000 18
15987.000 18
23981.000 12
...
9183.000 1
29923.000 1
9551.000 1
11558.000 1
10388.000 1
Name: Credit_Limit, Length: 6205, dtype: int64
sns.boxplot(data=data,x='Credit_Limit')
#Boxplot to show the distribution of Credit_Limit
<Axes: xlabel='Credit_Limit'>
50% of the customers have a credit_limit of below 5000 but there are many outliers on the upper end.
data['Total_Revolving_Bal'].nunique()
1974
data['Total_Revolving_Bal'].value_counts()
0 2470
2517 508
1965 12
1480 12
1434 11
...
2467 1
2131 1
2400 1
2144 1
2241 1
Name: Total_Revolving_Bal, Length: 1974, dtype: int64
sns.boxplot(data=data,x='Total_Revolving_Bal')
#Boxplot to show the distribution of Total_Revolving_Bal
<Axes: xlabel='Total_Revolving_Bal'>
data['Avg_Open_To_Buy'].nunique()
6813
data['Avg_Open_To_Buy'].value_counts()
1438.300 324
34516.000 98
31999.000 26
787.000 8
701.000 7
...
6543.000 1
2808.000 1
21549.000 1
6189.000 1
8427.000 1
Name: Avg_Open_To_Buy, Length: 6813, dtype: int64
sns.boxplot(data=data,x='Avg_Open_To_Buy')
#Boxplot to show the distribution of Avg_Open_To_Buy
<Axes: xlabel='Avg_Open_To_Buy'>
data['Total_Amt_Chng_Q4_Q1'].nunique()
1158
data['Total_Amt_Chng_Q4_Q1'].value_counts()
0.791 36
0.712 34
0.743 34
0.718 33
0.735 33
..
1.216 1
1.645 1
1.089 1
2.103 1
0.166 1
Name: Total_Amt_Chng_Q4_Q1, Length: 1158, dtype: int64
sns.boxplot(data=data,x='Total_Amt_Chng_Q4_Q1')
#Boxplot to show the distribution of Total_Amt_Chng_Q4_Q1
<Axes: xlabel='Total_Amt_Chng_Q4_Q1'>
data['Total_Trans_Amt'].nunique()
5033
data['Total_Trans_Amt'].value_counts()
4253 11
4509 11
4518 10
2229 10
4220 9
..
1274 1
4521 1
3231 1
4394 1
10294 1
Name: Total_Trans_Amt, Length: 5033, dtype: int64
sns.boxplot(data=data,x='Total_Trans_Amt')
#Boxplot to show the distribution of Total_Trans_Amt
<Axes: xlabel='Total_Trans_Amt'>
75% of the customers have a total transaction ammount of less than 5000 but there are many outliers on the higher end.
data['Total_Trans_Ct'].nunique()
126
data['Total_Trans_Ct'].value_counts()
81 208
71 203
75 203
69 202
82 202
...
11 2
134 1
139 1
138 1
132 1
Name: Total_Trans_Ct, Length: 126, dtype: int64
sns.boxplot(data=data,x='Total_Trans_Ct')
#Boxplot to show the distribution of Total_Trans_Ct
<Axes: xlabel='Total_Trans_Ct'>
data['Total_Ct_Chng_Q4_Q1'].nunique()
830
data['Total_Ct_Chng_Q4_Q1'].value_counts()
0.667 171
1.000 166
0.500 161
0.750 156
0.600 113
...
0.827 1
0.343 1
1.579 1
0.125 1
0.359 1
Name: Total_Ct_Chng_Q4_Q1, Length: 830, dtype: int64
sns.boxplot(data=data,x='Total_Trans_Ct')
#Boxplot to show the distribution of Total_Trans_Ct
<Axes: xlabel='Total_Trans_Ct'>
data['Avg_Utilization_Ratio'].nunique()
964
data['Avg_Utilization_Ratio'].value_counts()
0.000 2470
0.073 44
0.057 33
0.048 32
0.060 30
...
0.927 1
0.935 1
0.954 1
0.385 1
0.009 1
Name: Avg_Utilization_Ratio, Length: 964, dtype: int64
sns.boxplot(data=data,x='Avg_Utilization_Ratio')
#Boxplot to show the distribution of Avg_Utilization_Ratio
<Axes: xlabel='Avg_Utilization_Ratio'>
num_col = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15, 7))
sns.heatmap(data[num_col].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
Credit Limit and Attrition Flag have complete correlation. Total_Trans_Amt and Total_Trans_Ct have strong correlation. Months_on_book and Customer_Age also have a relatively strong correlation. There is a negative correlation between Total_Trans_Amt and Attrition flag.
sns.pairplot(data=data[num_col], diag_kind="kde")
plt.show()
distribution_plot_wrt_target(data, "Customer_Age", "Attrition_Flag")
stacked_barplot(data, "Gender", "Attrition_Flag")
Attrition_Flag 0 1 All Gender All 8500 1627 10127 F 4428 930 5358 M 4072 697 4769 ------------------------------------------------------------------------------------------------------------------------
Both genders have close to the same percentage of attrited customers with females having slightly more.
stacked_barplot(data, "Dependent_count", "Attrition_Flag")
Attrition_Flag 0 1 All Dependent_count All 8500 1627 10127 3 2250 482 2732 2 2238 417 2655 1 1569 269 1838 4 1314 260 1574 0 769 135 904 5 360 64 424 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Education_Level", "Attrition_Flag")
Attrition_Flag 0 1 All Education_Level All 7237 1371 8608 Graduate 2641 487 3128 High School 1707 306 2013 Uneducated 1250 237 1487 College 859 154 1013 Doctorate 356 95 451 Post-Graduate 424 92 516 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Marital_Status", "Attrition_Flag")
Attrition_Flag 0 1 All Marital_Status All 7880 1498 9378 Married 3978 709 4687 Single 3275 668 3943 Divorced 627 121 748 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Income_Category", "Attrition_Flag")
Attrition_Flag 0 1 All Income_Category All 8500 1627 10127 Less than $40K 2949 612 3561 $40K - $60K 1519 271 1790 $80K - $120K 1293 242 1535 $60K - $80K 1213 189 1402 abc 925 187 1112 $120K + 601 126 727 ------------------------------------------------------------------------------------------------------------------------
More than 25% of the attrited customers have less than $40K income
stacked_barplot(data, "Card_Category", "Attrition_Flag")
Attrition_Flag 0 1 All Card_Category All 8500 1627 10127 Blue 7917 1519 9436 Silver 473 82 555 Gold 95 21 116 Platinum 15 5 20 ------------------------------------------------------------------------------------------------------------------------
Most attrited customers hold blue cards.
distribution_plot_wrt_target(data, "Months_on_book", "Attrition_Flag")
stacked_barplot(data, "Total_Relationship_Count", "Attrition_Flag")
Attrition_Flag 0 1 All Total_Relationship_Count All 8500 1627 10127 3 1905 400 2305 2 897 346 1243 1 677 233 910 5 1664 227 1891 4 1687 225 1912 6 1670 196 1866 ------------------------------------------------------------------------------------------------------------------------
More than 50% have held 3 or less products.
stacked_barplot(data, "Months_Inactive_12_mon", "Attrition_Flag")
Attrition_Flag 0 1 All Months_Inactive_12_mon All 8500 1627 10127 3 3020 826 3846 2 2777 505 3282 4 305 130 435 1 2133 100 2233 5 146 32 178 6 105 19 124 0 14 15 29 ------------------------------------------------------------------------------------------------------------------------
826 of the total 1627 attrited customers had been inactive for 3 months. More than 50% of the attrited customers were inactive for atleast 3 months. More than 50% of the existing customers were inactive for 3 months or less.
stacked_barplot(data, "Contacts_Count_12_mon", "Attrition_Flag")
Attrition_Flag 0 1 All Contacts_Count_12_mon All 8500 1627 10127 3 2699 681 3380 2 2824 403 3227 4 1077 315 1392 1 1391 108 1499 5 117 59 176 6 0 54 54 0 392 7 399 ------------------------------------------------------------------------------------------------------------------------
The highest number of contacts (6) had all 100% of attrited customers.
distribution_plot_wrt_target(data, "Credit_Limit", "Attrition_Flag")
distribution_plot_wrt_target(data, "Total_Revolving_Bal", "Attrition_Flag")
Most Attrited customers had a total revolving balance below 1500 while more than 50% of the existing customers have more than 1000 dollars worth total revolving balance.
distribution_plot_wrt_target(data, "Avg_Open_To_Buy", "Attrition_Flag")
Both distributions are similar and there is no determining pattern visible to differentiate attrited customers based on Avg_Open_To_Buy
distribution_plot_wrt_target(data, "Total_Amt_Chng_Q4_Q1", "Attrition_Flag")
There are a lot of outliers on the upper end for existing customers but the outliers are balanced for the attrited customers and the total_amt_chnge_Q4_Q1 is less than one for most attrited customers.
distribution_plot_wrt_target(data, "Total_Trans_Amt", "Attrition_Flag")
Attrited Customers have a higher Total Transaction Amount distribution compared to existing customers.
distribution_plot_wrt_target(data, "Total_Trans_Ct", "Attrition_Flag")
distribution_plot_wrt_target(data, "Total_Ct_Chng_Q4_Q1", "Attrition_Flag")
distribution_plot_wrt_target(data, "Avg_Utilization_Ratio", "Attrition_Flag")
data2=data.copy()
IQR = data2.quantile(0.75) - data2.quantile(0.25) #interquartile range
lower_bound = data2.quantile(0.25)-1.5 * IQR #estabish lower bound
upper_bound = data2.quantile(0.75) + 1.5 *IQR #establish upper bound
outlier=((data2.select_dtypes(include=["float64", "int64"])< lower_bound) | (data2.select_dtypes(include=["float64", "int64"])>upper_bound)).sum()
outlier/len(data2)*100
Attrition_Flag 16.066 Customer_Age 0.020 Dependent_count 0.000 Months_on_book 3.812 Total_Relationship_Count 0.000 Months_Inactive_12_mon 3.268 Contacts_Count_12_mon 6.211 Credit_Limit 9.717 Total_Revolving_Bal 0.000 Avg_Open_To_Buy 9.509 Total_Amt_Chng_Q4_Q1 3.910 Total_Trans_Amt 8.848 Total_Trans_Ct 0.020 Total_Ct_Chng_Q4_Q1 3.891 Avg_Utilization_Ratio 0.000 dtype: float64
data2["Income_Category"].replace("abc", np.nan, inplace = True)
X = data2.drop(["Attrition_Flag"], axis=1)
y = data2["Attrition_Flag"]
# Splitting data into training, validation and test set:
# first split data into 2 parts
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y
)
# then split the first set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 19) (2026, 19) (2026, 19)
print("Number of rows in train data =", X_train.shape[0])
print("Number of rows in validation data =", X_val.shape[0])
print("Number of rows in test data =", X_test.shape[0])
Number of rows in train data = 6075 Number of rows in validation data = 2026 Number of rows in test data = 2026
data2.isnull().sum()
Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 1112 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
fr_imputer = SimpleImputer(strategy = 'most_frequent')
col_missing = ["Education_Level", "Marital_Status", "Income_Category"]
X_train[col_missing] = fr_imputer.fit_transform(X_train[col_missing])
X_val[col_missing] = fr_imputer.transform(X_val[col_missing])
X_test[col_missing] = fr_imputer.transform(X_test[col_missing])
# Checking that no column has missing values in train, validation or test sets
print(X_train.isna().sum())
print("-" * 30)
print(X_val.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 ------------------------------ Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 ------------------------------ Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
cols = X_train.select_dtypes(include=["object", "category"]) ##Training set
for i in cols.columns:
print(X_train[i].value_counts())
print("*" * 40)
F 3193 M 2882 Name: Gender, dtype: int64 **************************************** Graduate 2782 High School 1228 Uneducated 881 College 618 Post-Graduate 312 Doctorate 254 Name: Education_Level, dtype: int64 **************************************** Married 3276 Single 2369 Divorced 430 Name: Marital_Status, dtype: int64 **************************************** Less than $40K 2783 $40K - $60K 1059 $80K - $120K 953 $60K - $80K 831 $120K + 449 Name: Income_Category, dtype: int64 **************************************** Blue 5655 Silver 339 Gold 69 Platinum 12 Name: Card_Category, dtype: int64 ****************************************
cols = X_val.select_dtypes(include=["object", "category"]) ##Validation set
for i in cols.columns:
print(X_val[i].value_counts())
print("*" * 40)
F 1095 M 931 Name: Gender, dtype: int64 **************************************** Graduate 917 High School 404 Uneducated 306 College 199 Post-Graduate 101 Doctorate 99 Name: Education_Level, dtype: int64 **************************************** Married 1100 Single 770 Divorced 156 Name: Marital_Status, dtype: int64 **************************************** Less than $40K 957 $40K - $60K 361 $80K - $120K 293 $60K - $80K 279 $120K + 136 Name: Income_Category, dtype: int64 **************************************** Blue 1905 Silver 97 Gold 21 Platinum 3 Name: Card_Category, dtype: int64 ****************************************
cols = X_test.select_dtypes(include=["object", "category"]) ##Test data set
for i in cols.columns:
print(X_test[i].value_counts())
print("*" * 40)
F 1070 M 956 Name: Gender, dtype: int64 **************************************** Graduate 948 High School 381 Uneducated 300 College 196 Post-Graduate 103 Doctorate 98 Name: Education_Level, dtype: int64 **************************************** Married 1060 Single 804 Divorced 162 Name: Marital_Status, dtype: int64 **************************************** Less than $40K 933 $40K - $60K 370 $60K - $80K 292 $80K - $120K 289 $120K + 142 Name: Income_Category, dtype: int64 **************************************** Blue 1876 Silver 119 Gold 26 Platinum 5 Name: Card_Category, dtype: int64 ****************************************
X_train = pd.get_dummies(X_train, drop_first=True)
X_val = pd.get_dummies(X_val, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 29) (2026, 29) (2026, 29)
# check the top 5 rows from the train dataset
X_train.head()
| Customer_Age | Dependent_count | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | Gender_M | Education_Level_Doctorate | Education_Level_Graduate | Education_Level_High School | Education_Level_Post-Graduate | Education_Level_Uneducated | Marital_Status_Married | Marital_Status_Single | Income_Category_$40K - $60K | Income_Category_$60K - $80K | Income_Category_$80K - $120K | Income_Category_Less than $40K | Card_Category_Gold | Card_Category_Platinum | Card_Category_Silver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 800 | 40 | 2 | 21 | 6 | 4 | 3 | 20056.000 | 1602 | 18454.000 | 0.466 | 1687 | 46 | 0.533 | 0.080 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 498 | 44 | 1 | 34 | 6 | 2 | 0 | 2885.000 | 1895 | 990.000 | 0.387 | 1366 | 31 | 0.632 | 0.657 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 4356 | 48 | 4 | 36 | 5 | 1 | 2 | 6798.000 | 2517 | 4281.000 | 0.873 | 4327 | 79 | 0.881 | 0.370 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 407 | 41 | 2 | 36 | 6 | 2 | 0 | 27000.000 | 0 | 27000.000 | 0.610 | 1209 | 39 | 0.300 | 0.000 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 8728 | 46 | 4 | 36 | 2 | 2 | 3 | 15034.000 | 1356 | 13678.000 | 0.754 | 7737 | 84 | 0.750 | 0.090 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
The nature of predictions made by the classification model will translate as follows:
Which metric to optimize?
Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1
},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
Sample code for model building with original data
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Bagging", BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1, class_weight='balanced'), random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1, class_weight='balanced')))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1, class_weight='balanced')))
print("\nTraining Performance:\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_train, model.predict(X_train))
print("{}: {}".format(name, scores))
print("\nValidation Performance:\n")
for name, model in models:
model.fit(X_train, y_train)
scores_val = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores_val))
Training Performance: Bagging: 0.9774590163934426 Random forest: 1.0 GBM: 0.875 Adaboost: 0.826844262295082 dtree: 1.0 Validation Performance: Bagging: 0.7699386503067485 Random forest: 0.7484662576687117 GBM: 0.8558282208588958 Adaboost: 0.852760736196319 dtree: 0.7944785276073619
print("\nTraining and Validation Performance Difference:\n")
for name, model in models:
model.fit(X_train, y_train)
scores_train = recall_score(y_train, model.predict(X_train))
scores_val = recall_score(y_val, model.predict(X_val))
difference1 = scores_train - scores_val
print("{}: Training Score: {:.4f}, Validation Score: {:.4f}, Difference: {:.4f}".format(name, scores_train, scores_val, difference1))
Training and Validation Performance Difference: Bagging: Training Score: 0.9775, Validation Score: 0.7699, Difference: 0.2075 Random forest: Training Score: 1.0000, Validation Score: 0.7485, Difference: 0.2515 GBM: Training Score: 0.8750, Validation Score: 0.8558, Difference: 0.0192 Adaboost: Training Score: 0.8268, Validation Score: 0.8528, Difference: -0.0259 dtree: Training Score: 1.0000, Validation Score: 0.7945, Difference: 0.2055
GBM performed the best with a very small difference of 0.0192 between the training score and validation score. AdaBoost also performed well with a small difference between the validation and training score.
print("Before Oversampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Oversampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
sm = SMOTE(
sampling_strategy=1, k_neighbors=5, random_state=1
) # Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After Oversampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After Oversampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))
print("After Oversampling, the shape of train_X: {}".format(X_train_over.shape))
print("After Oversampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before Oversampling, counts of label 'Yes': 976 Before Oversampling, counts of label 'No': 5099 After Oversampling, counts of label 'Yes': 5099 After Oversampling, counts of label 'No': 5099 After Oversampling, the shape of train_X: (10198, 29) After Oversampling, the shape of train_y: (10198,)
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Bagging", BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1, class_weight='balanced'), random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1, class_weight='balanced')))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1, class_weight='balanced')))
print("\n" "Training Performance:" "\n")
for name, model in models:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_train_over, model.predict(X_train_over))
print("{}: {}".format(name, scores))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Training Performance: Bagging: 0.9974504804863699 Random forest: 1.0 GBM: 0.980976662090606 Adaboost: 0.9690135320651108 dtree: 1.0 Validation Performance: Bagging: 0.8496932515337423 Random forest: 0.8680981595092024 GBM: 0.8926380368098159 Adaboost: 0.901840490797546 dtree: 0.8251533742331288
print("\nTraining and Validation Performance Difference:\n")
for name, model in models:
model.fit(X_train_over, y_train_over)
scores_train = recall_score(y_train_over, model.predict(X_train_over))
scores_val = recall_score(y_val, model.predict(X_val))
difference2 = scores_train - scores_val
print("{}: Training Score: {:.4f}, Validation Score: {:.4f}, Difference: {:.4f}".format(name, scores_train, scores_val, difference2))
Training and Validation Performance Difference: Bagging: Training Score: 0.9975, Validation Score: 0.8497, Difference: 0.1478 Random forest: Training Score: 1.0000, Validation Score: 0.8681, Difference: 0.1319 GBM: Training Score: 0.9810, Validation Score: 0.8926, Difference: 0.0883 Adaboost: Training Score: 0.9690, Validation Score: 0.9018, Difference: 0.0672 dtree: Training Score: 1.0000, Validation Score: 0.8252, Difference: 0.1748
Adaboost and GBM performed the best but not as good as their original data set counterparts..
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))
print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Under Sampling, counts of label 'Yes': 976 Before Under Sampling, counts of label 'No': 5099 After Under Sampling, counts of label 'Yes': 976 After Under Sampling, counts of label 'No': 976 After Under Sampling, the shape of train_X: (1952, 29) After Under Sampling, the shape of train_y: (1952,)
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Bagging", BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1, class_weight='balanced'), random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1, class_weight='balanced')))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1, class_weight='balanced')))
print("\n" "Training Performance:" "\n")
for name, model in models:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_train_un, model.predict(X_train_un))
print("{}: {}".format(name, scores))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores))
Training Performance: Bagging: 0.9907786885245902 Random forest: 1.0 GBM: 0.9805327868852459 Adaboost: 0.9528688524590164 dtree: 1.0 Validation Performance: Bagging: 0.9294478527607362 Random forest: 0.9386503067484663 GBM: 0.9570552147239264 Adaboost: 0.9601226993865031 dtree: 0.9202453987730062
print("\nTraining and Validation Performance Difference:\n")
for name, model in models:
model.fit(X_train_un, y_train_un)
scores_train = recall_score(y_train_un, model.predict(X_train_un))
scores_val = recall_score(y_val, model.predict(X_val))
difference3 = scores_train - scores_val
print("{}: Training Score: {:.4f}, Validation Score: {:.4f}, Difference: {:.4f}".format(name, scores_train, scores_val, difference3))
Training and Validation Performance Difference: Bagging: Training Score: 0.9908, Validation Score: 0.9294, Difference: 0.0613 Random forest: Training Score: 1.0000, Validation Score: 0.9387, Difference: 0.0613 GBM: Training Score: 0.9805, Validation Score: 0.9571, Difference: 0.0235 Adaboost: Training Score: 0.9529, Validation Score: 0.9601, Difference: -0.0073 dtree: Training Score: 1.0000, Validation Score: 0.9202, Difference: 0.0798
AdaBoost performed the best. GBoost also performed will with a difference of 0.0235.
Since AdaBoost original data, AdaBoost undersampled data, and all three GBoost models performed the best and had the least difference between their validation score and test score, we will be performing hypertuning on those 5 models.
%%time
# defining model
Model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid ={
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"base_estimator": [
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 100, 'learning_rate': 0.1, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.8360596546310832:
CPU times: user 3.46 s, sys: 327 ms, total: 3.78 s
Wall time: 1min 40s
tuned_adb = AdaBoostClassifier(
random_state=1,
n_estimators=100,
learning_rate=0.1,
base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
)
tuned_adb.fit(X_train, y_train)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.1, n_estimators=100, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.1, n_estimators=100, random_state=1)DecisionTreeClassifier(max_depth=3, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
# Check model performance on training set
adb_train = model_performance_classification_sklearn(tuned_adb, X_train, y_train)
adb_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.982 | 0.927 | 0.961 | 0.944 |
# Check model performance on validation set
adb_val = model_performance_classification_sklearn(tuned_adb, X_val, y_val)
adb_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.967 | 0.856 | 0.933 | 0.893 |
%%time
# defining model
Model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid ={
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"base_estimator": [
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 100, 'learning_rate': 0.05, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.9467346938775512:
CPU times: user 1.44 s, sys: 92.8 ms, total: 1.54 s
Wall time: 42.6 s
tuned_adb1 = AdaBoostClassifier(
random_state=1,
n_estimators=100,
learning_rate=0.05,
base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
)
tuned_adb1.fit(X_train_un, y_train_un)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.05, n_estimators=100, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.05, n_estimators=100, random_state=1)DecisionTreeClassifier(max_depth=3, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
# Check model performance on training set
adb1_train = model_performance_classification_sklearn(tuned_adb1, X_train_un, y_train_un)
adb1_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.973 | 0.978 | 0.968 | 0.973 |
# Check model performance on validation set
adb1_val = model_performance_classification_sklearn(tuned_adb1, X_val, y_val)
adb1_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.937 | 0.966 | 0.731 | 0.832 |
%%time
#Creating pipeline
Model = GradientBoostingClassifier(random_state=1)
#Parameter grid to pass in RandomSearchCV
param_grid = {
"init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'n_estimators': 100, 'max_features': 0.5, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.8104395604395604:
CPU times: user 3.7 s, sys: 361 ms, total: 4.06 s
Wall time: 2min 35s
tuned_gbm = GradientBoostingClassifier(
random_state=1,
subsample=0.9,
n_estimators=100,
max_features=0.5,
learning_rate=0.1,
init=AdaBoostClassifier(random_state=1),
)
tuned_gbm.fit(X_train, y_train)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.5, random_state=1, subsample=0.9)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.5, random_state=1, subsample=0.9)AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
# Check model performance on training set
gbm_train = model_performance_classification_sklearn(
tuned_gbm, X_train, y_train
)
gbm_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.972 | 0.867 | 0.955 | 0.909 |
# Check model performance on validation set
gbm_val = model_performance_classification_sklearn(tuned_gbm, X_val, y_val)
gbm_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.968 | 0.862 | 0.937 | 0.898 |
%%time
#Creating pipeline
Model = GradientBoostingClassifier(random_state=1)
#Parameter grid to pass in RandomSearchCV
param_grid = {
"init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'n_estimators': 75, 'max_features': 0.7, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.9508267922553637:
CPU times: user 1.85 s, sys: 156 ms, total: 2.01 s
Wall time: 1min 7s
tuned_gbm1 = GradientBoostingClassifier(
random_state=1,
subsample=0.9,
n_estimators=75,
max_features=0.7,
learning_rate=0.1,
init=AdaBoostClassifier(random_state=1),
)
tuned_gbm1.fit(X_train_un, y_train_un)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.7, n_estimators=75, random_state=1,
subsample=0.9)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.7, n_estimators=75, random_state=1,
subsample=0.9)AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
# Check model performance on training set
gbm1_train = model_performance_classification_sklearn(
tuned_gbm1, X_train_un, y_train_un
)
gbm1_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.970 | 0.977 | 0.964 | 0.970 |
# Checking model's performance on validation set
gbm1_val = model_performance_classification_sklearn(tuned_gbm1, X_val, y_val)
gbm1_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.938 | 0.957 | 0.738 | 0.833 |
%%time
#defining model
Model = GradientBoostingClassifier(random_state=1)
#Parameter grid to pass in RandomSearchCV
param_grid = {
"init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over, y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'n_estimators': 75, 'max_features': 0.7, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.9447041505512901:
CPU times: user 5.63 s, sys: 592 ms, total: 6.22 s
Wall time: 4min 10s
tuned_gbm2 = GradientBoostingClassifier(
random_state=1,
subsample=0.9,
n_estimators=75,
max_features=0.7,
learning_rate=0.1,
init=AdaBoostClassifier(random_state=1),
)
tuned_gbm2.fit(X_train_over, y_train_over)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.7, n_estimators=75, random_state=1,
subsample=0.9)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.7, n_estimators=75, random_state=1,
subsample=0.9)AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
# Check model performance on training set
gbm2_train = model_performance_classification_sklearn(tuned_gbm2, X_train_over, y_train_over)
gbm2_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.973 | 0.977 | 0.968 | 0.973 |
# Check model performance on validation set
gbm2_val = model_performance_classification_sklearn(tuned_gbm2, X_val, y_val)
gbm2_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.952 | 0.887 | 0.826 | 0.855 |
Note
param_grid = {
"init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}
param_grid = {
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"base_estimator": [
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
param_grid = {
'max_samples': [0.8,0.9,1],
'max_features': [0.7,0.8,0.9],
'n_estimators' : [30,50,70],
}
param_grid = {
"n_estimators": [50,110,25],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1)
}
param_grid = {
'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes' : [10, 15],
'min_impurity_decrease': [0.0001,0.001]
}
param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.1,0.05],
'gamma':[1,3],
'subsample':[0.7,0.9]
}
# defining model
Model = DecisionTreeClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes' : [10,15],
'min_impurity_decrease': [0.0001,0.001] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 15, 'max_depth': 5} with CV score=0.751941391941392:
# defining model
Model = DecisionTreeClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes' : [10,15],
'min_impurity_decrease': [0.0001,0.001] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 1, 'min_impurity_decrease': 0.001, 'max_leaf_nodes': 15, 'max_depth': 4} with CV score=0.9111622313302161:
# defining model
Model = DecisionTreeClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes' : [10,15],
'min_impurity_decrease': [0.0001,0.001] }
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 15, 'max_depth': 5} with CV score=0.8934432234432235:
# performance comparison for training set
models_train_comp_df = pd.concat(
[
gbm_train.T,
gbm1_train.T,
gbm2_train.T,
adb1_train.T,
adb_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Gradient boosting trained with Original data",
"Gradient boosting trained with Undersampled data",
"Gradient boosting trained with Oversampled data",
"AdaBoost trained with Undersampled data",
"AdaBoost trained with Original data",
]
print("Performance comparison for training set:")
models_train_comp_df
Performance comparison for training set:
| Gradient boosting trained with Original data | Gradient boosting trained with Undersampled data | Gradient boosting trained with Oversampled data | AdaBoost trained with Undersampled data | AdaBoost trained with Original data | |
|---|---|---|---|---|---|
| Accuracy | 0.972 | 0.970 | 0.973 | 0.973 | 0.982 |
| Recall | 0.867 | 0.977 | 0.977 | 0.978 | 0.927 |
| Precision | 0.955 | 0.964 | 0.968 | 0.968 | 0.961 |
| F1 | 0.909 | 0.970 | 0.973 | 0.973 | 0.944 |
# performance comparison for validation set
models_val_comp_df = pd.concat(
[
gbm_val.T,
gbm1_val.T,
gbm2_val.T,
adb1_val.T,
adb_val.T,
],
axis=1,
)
models_val_comp_df.columns = [
"Gradient boosting trained with Original data",
"Gradient boosting trained with Undersampled data",
"Gradient boosting trained with Oversampled data",
"AdaBoost trained with Undersampled data",
"AdaBoost trained with Original data",
]
print("Performance comparison for validation set:")
models_val_comp_df
Performance comparison for validation set:
| Gradient boosting trained with Original data | Gradient boosting trained with Undersampled data | Gradient boosting trained with Oversampled data | AdaBoost trained with Undersampled data | AdaBoost trained with Original data | |
|---|---|---|---|---|---|
| Accuracy | 0.968 | 0.938 | 0.952 | 0.937 | 0.967 |
| Recall | 0.862 | 0.957 | 0.887 | 0.966 | 0.856 |
| Precision | 0.937 | 0.738 | 0.826 | 0.731 | 0.933 |
| F1 | 0.898 | 0.833 | 0.855 | 0.832 | 0.893 |
Gradient Boosting trained with original data performed the best in general, therefore we will take it as our model for the test set final performance.
# Let's check the performance on test set
gbm_test = model_performance_classification_sklearn(tuned_gbm, X_test, y_test)
gbm_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.970 | 0.874 | 0.937 | 0.904 |
feature_names = X_train.columns
importances = tuned_gbm.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Total_Trans_Amt has the most importance, followed by Total_Trans_Ct and Total_Revolving_Bal